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Abstract 

We introduce a new online convex optimization algorithm that adaptively chooses its regulariza- 
tion function based on the loss functions observed so far. This is in contrast to previous algorithms that 
use a fixed regularization function such as L2-squared, and modify it only via a single time-dependent 
parameter. Our algorithm's regret bounds are worst-case optimal, and for certain realistic classes of 
loss functions they are much better than existing bounds. These bounds are problem-dependent, 
which means they can exploit the structure of the actual problem instance. Critically, however, our 
algorithm does not need to know this structure in advance. Rather, we prove competitive guarantees 
that show the algorithm provides a bound within a constant factor of the best possible bound (of a 
certain functional form) in hindsight. 



1 Introduction 

We consider online convex optimization in the full information feedback setting. A closed, bounded 
convex feasible set T C M™ is given as input, and on each round t = 1, . . . , T, we must pick a point 
Xt G J- . A convex loss function f t is then revealed, and we incur loss ft(xt). Our regret at the end of T 
rounds is 

T T 

Regret = V" f t (x t ) - min V f t (x) . ( 1 ) 

' x^T — ' 

t=l t=l 

Existing algorithms for online convex optimization are worst-case optimal in terms of certain fun- 
damental quantities. In particular, online gradient descent attains a bound of 0{DM\fT) where D is 
the L2 diameter of the feasible set and M is a bound on L2-norm of the gradients of the loss functions. 
This bound is tight in the worst case, in that it is possible to construct problems where this much regret 
is inevitable. However, this does not mean that an algorithm that achieves this bound is optimal in a 
practical sense, as on easy problem instances such an algorithm is still allowed to incur the worst-case 
regret. In particular, al though this bound is minimax optimal when the feasible set is a hypersphere 
I Abernethv et al.1 1200811 . we will see that much better algorithms exist when the feasible set is the hy- 
percube. 

To improve over the existing worst-case guarantees, we introduce additional parameters that capture 
more of the problem's structure. These parameters depend on the loss functions, which are not known in 



1 



advance. To address this, we first construct functional upper bounds on regret Br(8i, . . . , 9t', fa> • ■ ■ > /t) 
that depend on both (properties of) the loss functions fa and algorithm parameters t . We then give al- 
gorithms for choosing the parameters Ot adaptively (based only on /i , fa , . . . , fa- 1 ) and prove that these 
adaptive schemes provide a regret bound that is only a constant factor worse than the best possible regret 
bound of the form Br. Formally, if for all possible function sequences /i, . . . /t we have 

Br{Qi, ■ ■ ■ , T ; fa, . . . , f T ) < k inf B R (8[, . . . , 9' T ; fa, . . . , f T ) 

for the adaptively-selected t , we say the adaptive scheme is K-competitive for the bound optimization 
problem. In Section [L2l we provide realistic examples where known bounds are much worse than the 
problem-dependent bounds obtained by our algorithm. 



1.1 Follow the proximally-regularized leader 

We analyze a follow the regularized leader (FTRL) algorithm that adaptively selects regularization 
functions of the form 

n{x)= l -\\{Qhx-x t )\\l 

where Q t is a positive semidefinite matrix. Our algorithm plays x\ = on round 1 (we assume without 
loss of generality that € J 7 ), and on round 4+1, selects the point 

x t +i = argmin V (r T (x) + f T {x)) ■ (2) 



In contrast to other FTRL algorithms, such as the dual averaging method of lXiao l2009ll . we center the 



additional regularization at the current feasible point xt rather than at the origin. Accordingly, we call 
this algorithm/o//ow the proximally-regularized leader (FTPRL). This proximal centering of additional 
regularization is similar in sp irit to the optimization solved by online gradient descent (and more gen- 



_by 

erally, online mirror descent, llCesa-Bianchi and Lugosii 1200611 ). However, rather than considering only 



the current gradient, our algorithm considers the sum of all previous gradients, and so solves a global 
rather than local optimization on each round. We discuss related work in more detail in Section |4] 

The FTPRL algorithm allows a clean analysis from first principles, which we present in Section [2] 
The proof techniques are rather different from those used for online gradient descent algorithms, and 
will likely be of independent interest. 

We write Qt as shorthand for (Qi, Q2, ■ ■ ■ , Qt), with gr defined analogously. For a convex set F, 
we define J- sym — {x — x' \ x, x' € J 7 }. Using this notation, we can state our regret bound as 

^ T T 
Regret < B R {Q T , g T ) = - V max (y T Q t y) + V gj Q^gt (3) 
1 t=i v<£ s>m t=i 

where g t is a subgradient of fa at x t and Qi:t = £ T= i Qt- We prove competitive ratios with respect to 
this Br for several adaptive schemes for selecting the Q t matrices. In particular, when the FTPRL-Diag 
scheme is run on a hyperrectangle (a set of the form {x \ Xi £ [a,i, bi]} C M. n ), we achieve 

Regret <V2 inf B R {Q T ,g T ) 
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where Qdiag = {diag(Ai, . . . , A„) | A^ > 0}. When the FTPRL-Scale scheme is run on a feasible set 
of the form T = {x | || Ax\\2 < 1} for A E it is competitive with arbitrary positive semidefinite 

matrices: 

Regret < s/2 inf B R (Q T , gr) ■ 

Our analysis of FTPRL reveals a fundamental connection between the shape of the feasible set and 
the importance of choosing the regularization matrices adaptively. When the feasible set is a hyperrect- 
angle, FTPRL-Diag has stronger bounds than known algorithms, except for degenerate cases where the 
bounds are identical. In contrast, when the feasible set is a hypersphere, {x \ \\x\\-2 < 1}, the bound Br 
is always optimized by choosing Q t = A t 7 for suitable X t <E R. The FTPRL-Scale scheme extends this 
result to hyperellipsoids by applying a suitable transformation. These results are presented in detail in 
Section [3] 



1.2 The practical importance of adaptive regularization 



In the past few years, online algor ithms have emerged as state-of-the-art tech niques for solving large- 
scale machine learning problems IBottou and Bousquen, 120081 IZhana, [20041. Two canonical exam- 
ples of such large-scale learning problems are text classification on large datasets and predicting click- 
through rates for ads on a search engine. For such problems, extremely large feature sets may be 
considered, but many features only occur rarely, while few occur very often. Our diagonal-adaptation 
algorithm offers improved bounds for problems such as these. 

As an example, suppose T = [— |, (so D = ^/n). On each round i, the ith component of 
Vft(xt) (henceforth g t j) is 1 with probability i~ a , and is otherwise, for some a € [1,2). Such 
heavy-tailed distributions are common in text classification applications, where there is a feature for 
each word. In this case, gradient descent with a global learning rat^] obtains an expected regret bound 
of 0(V nT). In contrast, the algorithms presented in this paper will obtain expected regret on the order 
of 
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E 



Erf, 



E^y = e^f^=o(Vt. 



) 



/=i 



using Jensen's inequality. This bound is never worse than the 0{\J nT) bound achieved by ordinary 
gradient descent, and can be substantially better. For example, in problems where a constant fraction of 
examples contain a new feature, n is f2(T) and the bound for ordinary gradient descent is vacuous. In 
contrast, the bound for our algorithm is 0(T^~ ), which is sublinear for a > 1. 

This performance difference is not merely a weakness in the r egret bounds for ordinary gra dient 
descent, but is a difference in actual regret. In concurrent work IStreeter and McMahanl l2010ll . we 
showed that for some problem families, a per-coordinate learning rate for online gradient descent pro- 
vides asymptotically less regret than even the best non-increasing global learning rate (chosen in hind- 
sight, given the observed loss functions). This construction can be adapted to FTPRL as: 

Theorem 1. There exists a family of online convex optimization problems, parametrized by the num- 
ber of rounds T, where online subgradient descent with a non-increasing learning rate sequence (and 
FTPRL with non-decreasing coordinate-constant regularization) incurs regret at least f2(T^ ), whereas 
FTPRL with appropriate diagonal regularization matrices Q t has regret 0(\/T). 



'The 0{DM\/T) bound (mentioned in the introduction) based on a l/\/t learning rate gives 0{n\/T) here; to get 0(V nT) 
a global rate based on \\g^\\ is needed, e.g., Corollary [T] 
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In fact, any online learning algorithm whose regret is 0(M D\fT) (where D is the L2 diameter of 
the feasible region, and M is a bound on the L2 norm of the gradients) will suffer regret fi(T3) on 
this family of problems. Note that this does not contradict the 0(MD^/T) upper bound on the regret, 
because in this family of problems D = Te (and M = 1). 

1.3 Adaptive algorithms and competitive ratios 

In Section[3] we introduce specific schemes for selecting the regularization matrices Q t for FTPRL, and 
show that for certain feasible sets, these algorithms provide bounds within a constant factor of those for 
the best post-hoc choice of matrices, namely 

inf B r (Q t , 9 t) (4) 

Qt€Q t 

where Q C 5" is a set of allowed matrices; 5™ is the set of symmetric positive semidefinite n x n 
matrices, with 5™ + the corresponding set of symmetric positive definite matrices. We consider three 
different choices for Q: the set of coordinate-constant matrices Q const = {al | a > 0}; the set of non- 
negative diagonal matrices, 

Q diag = {diag(Ai,...,A n ) I Xi > 0}; 

and, the full set of positive-semidefinite matrices, Qf u n = S™. 

We first consider the case where the feasible region is an L p unit ball, namely T = {x | ||a;||p < 1}. 
For p <G [1, 2], we show that a simple algorithm (an analogue of standard online gradient descent) that 
selects matrices from Q cc . nst is -s/2-competitive with the best post-hoc choice of matrices from the full 
set of positive semidefinite matrices Qm\ = S+- This algorithm is presented in Corollary [TJ and the 
competitive ratio is proved in Theorem[6] 

In contrast to the result forp G [1, 2], we show that for L p balls with p > 2 a coordinate-independent 
choice of matrices (Qt £ Qconst) does not in general obtain the post-hoc optimal bound (see Sec- 
tion [33]», and hence per-coordinate adaptation can help. The benefit of per-coordinate adaptation is 
most pronounced for the Loo-ball, where the coordinates are essentially independent. In light of this, 
we develop an efficient algorithm (FTPRL-Diag, Algorithm [TJ for adaptively selecting Q t from Qdiag, 
which uses scaling based on the width of T in the coordinate directions (Corollary|2]i. In this corollary, 
we also show that this algorithm \/2-competitive with the best post-hoc choice of matrices from Q d i ag 
when the feasible set is a hyperrectangle. 

While per-coordinate adaptation does not help for the unit L2-ball, it can help when the feasible 
set is a hyperellipsoid. In particular, in the case where T = {x | ||Ac||2 < 1} for A G S++> we 
show that an appropriate transformation of the problem can produce significantly better regret bounds. 
More generally, we show (see Theorem that if one has a K-competitive adaptive FTPRL scheme 
for the feasible set {x | < 1} for an arbitrary norm, it can be extended to provide a K-competitive 
algorithm for feasible sets of the form {x | \\Ax\\ < 1}. Using this result, we can show FTPRL-Scale 
is -s/2-competitive with the best post-hoc choice of matrices from 5" when T = {x \ \\Ax\\2 < 1} and 
A G it is \/2-competitive with Q d i ag when T = {x \ \\Ax\\ p < 1} forp G [1, 2). 

Of course, in many practical applications the feasible set may not be so nicely characterized. We 
emphasize that our algorithms and analysis are applicable to arbitrary feasible sets, but the quality of 
the bounds and competitive ratios will depend on how tightly the feasible set can be approximated by a 
suitably chosen transformed norm ball. In Theorem[3] we show in particular that when FTPRL-Diag is 
applied to an arbitrary feasible set, it provides a competitive guarantee related to the ratio of the widths 
of the smallest hyperrectangle that contains T to the largest hyperrectangle contained in T . 
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1.4 Notation and technical background 

We use the notation g\ :t as a shorthand for J2t=i 9t- Similarly we write Q\.t for a sum of matrices 
Qt, and fi :t to denote the function fi-.t(x) = J2t=i fr{x). We write x T y or x ■ y for the inner product 
between x, y G R". The ith entry in a vector x is denoted Xi G R; when we have a sequence of vectors 
x t G R" indexed by time, the ith entry is xts G R. We use df(x) to denote the set of subgradients of / 
evaluated at x. 

Recall A G 5" + means Vx 7^ 0, x T Ax > 0. We use the generalized inequality A x when 
^4 G and similarly A -< S when B - A y 0, implying x 1 Ax < x T Bx. We define A -< B 

analogously for symmetric positive semidefinite matrices S 7 }. For B G S r l, we write B 1 / 2 for the 



square root of B, the unique X G 5" such that XX = B (see, for example. iBovd and Vandenber^he 



1 2004 A. 5. 2]). We also make use of the fact that any A G 5" can be factored as A = PDP where 



P 1 P = I and D = diag(A 1; . . , A„) where \ are the eigenvalues of A 



Following the arguments of lZinkevichl 0200311 . for the remainder we restrict our attention to linear 
functions. Briefly, the convexity of f t implies ft(x) > gj (x — x t ) + ft(xt)> where gt G df(xt). 
Because this inequality is tight for x = x t , it follows that regret measured against the affine functions 
on the right hand side is an upper bound on true regret. Furthermore, regret is unchanged if we replace 
this affine function with the linear function gjx. Thus, so long as our algorithm only makes use of the 
subgradients g t , we may assume without loss of generality that the loss functions are linear. 

Taking into account this reduction and the functional form of the r t , the update of FTPRL is 



x t+ i = argmin | - } jx - x T ) Q T (x - x T ) + g 1:t ■ x ] . (5) 



2 Analysis of FTPRL 

In this section, we prove the following bound on the regret of FTPRL for an arbitrary sequence of 
regularization matrices Q t . In this section j| • j| always means the L2 norm, || • I2. 

Theorem 2. Let J- C R™ be a closed, bounded convex set with G T. Let Q± G S 1 !,, and 

Q2, • • • j Qt G S™. Define r t {x) = ^||Q t 2 (x — an d A t = (Qi-.t)^- Let f t be a sequence of 

loss functions, with g t G dft(x t ) a sub-gradient of f t at Xt- Then, the FTPRL algorithm that that faces 
loss functions f, plays x\ — 0, and uses the update of Equation © thereafter, has a regret bound 

T 

Regret < r 1:T (x) + ^ WA^gtf 
t=i 

where x = argmin^gjr fi-.T(x) is the post-hoc optimal feasible point. 

To prove Theorem|2]we will make use of the following bound on the regret of FTRL, which holds 
for arbitrary (possibly non -convex) loss functions. This lemma can be proved along the lines of 
I Kalai and Vempala. 1200511 : for completeness, a proof is included in AppendixlAl 

Lemma 1. Let r\,r2, ■ ■ ■ ,Tt be a sequence of non-negative functions. The regret of FTPRL (which 
plays Xt as defined by Equation ©J is bounded by 

T 

1"l:T(x) + ^ (ft(Xt) - ft{Xt+l)) 
t=l 
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where x is the post-hoc optimal feasible point. 

Once Lemma[T]is established, to prove Theorem[2]it suffices to show that for all t, 

ft(x t ) - f t (x t+1 ) < WA-'gtf. (6) 

To show this, we first establish an alternative characterization of our algorithm as solving an un- 
constrained optimization followed by a suitable projection onto the feasible set. Define the projection 
operator, 

Pf,a(u) = argmin \\A(x — u)\\ 
We will show that the following is an equivalent formula for Xt- 

u t +i = argmin (ru.t{u) + gut ■ u) 

net" 

xt+i = Pr,A t (ut+i) ■ (7) 

This characterization will be useful, because the unconstrained solutions depend only on the linear 
functions g t , and the quadratic regularization, and hence are easy to manipulate in closed form. 
To show this equivalence, first note that because Q t G 52 is symmetric, 

r t(u) = -(u- x t ) T Qt(u - x t ) = -u T Q t u - xjQ t u t + ^xjQtXt- 
Defining constants q t = QtXt and k t = ^xjQtXt, we can write 

rut{u) = -u T Qutu - qx-tu + kut- (8) 

The equivalence is then a corollary of the following lemma, choosing Q = Qut an d h — gut — Qut 
(note that the constant term kut does not influence the argmin). 

Lemma 2. Let Q £ 5? + and h £ W l , and consider the function 

f(x) = h T x + ^x T Qx. 

Let u = argrnin MgR „ /(it). Then, letting A = , we have Pjr^(u) = argmin^g^r f(x). 
Proof. Note that V M /(ii) = h + Qu, implying that u = —Q^ 1 h. Consider the function 

/'(*) = \\\QHx ~ u)\\ 2 = \{x - u) T Q(x - u). 



We have 



f'{x) — — (x 1 Qx — 2x T Qu + u T Qu^J (because Q is symmetric) 

= ^(x T Qx + 2x T Q(Q)- 1 h + u T ( 
= i (x T Qx + 2x T h + u T Qu 
= f(x) + l -u T Qu . 
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Because ^u T Qu is constant with respect to x, it follows that 

argmin/(x) = argmin/ (x) = Pf^a(u), 

where the last equality follows from the definition of the projection operator. 



□ 



We now derive a closed-form solution to the unconstrained problem. It is easy to show Vrt(u) = 
Q t u - Qtx t , and so 



Vr 1:t (u) = Q 1:t u - ^ QrX T - 



T = l 



Because u t +\ is the optimum of the (strongly convex) unconstrained problem, and n-.t is differentiable, 
we must have Vrut (ut+i) + g\ :t = 0. Hence, we conclude Qi-tUt+i — J2r=i Qt%t + gi-.t = 0, or 



Ut+l = Qi : J I ^2 Q T X T - 9l:t 



(9) 



vr=l 



This closed-form solution will let us bound the difference between u t and Ut+\ in terms of gt- The 
next Lemma relates this distance to the difference between xt and Xt+i, which determines our per round 
regret (Equation (|6]l). In particular, we show that the projection operator only makes u t and u t +i closer 
together, in terms of distance as measured by the norm \\A t ■ ||. We defer the proof to the end of the 
section. 

Lemma 3. Let Q 6 S++ with A = Q%. Let T be a convex set, and let u\,u 2 £ R", with x\ = 
Pjr A (ui) and X2 = Pf,a{u 2 ). Then, 

\\A(x 2 -x 1 )\\ < \\A(u! -u 2 )\\. 

We now prove the following lemma, which will immediately yield the desired bound on ft(%t) — 
ft(x t +i). 

Lemma 4. Let Q e S" + with A = Q^. Let v, g € R™, and let u\ = —Q~ x v and u 2 = — + g). 

Then, letting X\ = Pjf,a(^i) and x 2 = Pj^.a{u 2 ), 



IBovd and Vandenberehe 2004, 


g T (x! - x 2 ) < 


A 


~h\ 


< 


A 


~h\ 




A 


~h\ 




A 


-h\ 




A 


~ l 9\ 



9 l (xi-x 2 )<\\A~ 1 g\ 2 



\A{x 1 -x 2 )\\ 
\A( Ul -u 2 )\\ 

\A(Q~ 1 9)\\ 
\A{A- 1 A- 1 )g)\\ 

\A- l g\\. 



i 1 • || are dual norms (see for exam- 
this fact, 

(Lemma [3} 

(Because Q- 1 = (AA)^ 1 ) 



□ 
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Proof of Theorem[2l First note that because r t (x t ) = Oandr t is non-negative, xt = argmin z . g jrr' t (x). 
For any functions / and g, if x* = argmin^gjr f(x) and x* = &rgmm xe:F g(x), then 

x* = argmin (f(x) + g(x)) . 

x^T 

Thus we have 

x t = argmin (gi-.t-ix + ri :t -i(x)) 

= argmin (gi-.t-ix + ri-t{x)) (Because x t = argmin rt(x).) 

= argmin ( hx + — x T Qi :t x 



where the last line follows from Equation (0, letting h = gut-i — Qut = gut-i — St=i QtX t , and 
dropping the constant k\- t . For x t+1 , we have directly from the definitions 

Xt+i = argmin (gi :t x + r 1:t (x)) = argmin ( (h + g t )x + l-x T Q 1:t x 

x£T x^T \ * 

Thus, Lemma [2] implies x t = Pjj A t {~ (Qi:t) -1 ^) an d similarly x t+ i = Pj? At {~{Qi-.t)~ l {h + 9t))- 
Thus, by Lemma[4] gt(x t — x t+ i) < \\A^~ 1 g t \\ 2 . The theorem then follows from LemmaQ] □ 

Proof of Lemma |3j Define 

B(x,u) = h\A(x- u)\\' 2 = ^{x-u) T Q{x~u), 
so we can write equivalently 

x\ = argmin B(x, u±). 

x£F 

Then, note that V x B(x, u\) = Qx — Qui, and so we must have (Qx% — Qui) T (x2 — x\) > 0; otherwise 
for 5 sufficiently small the point xi + S(x2 —Xx) would belong to T (by convexity) and would be closer 
to Mi than x\ is. Similarly, we must have (Qx2 — Qu 2 ) T (xi — x 2 ) > 0. Combining these, we have the 
following equivalent inequalities: 

{Qx\ - Qui) T (x 2 - xi) - {Qx 2 - Qu 2 ) T (x 2 — Xi) > 

(xi - ui) T Q(x 2 - xi) - (x 2 - u 2 ) T Q(x 2 - xx) > 

— (x 2 - x 1 ) T Q(x 2 - xi) + (u 2 - u 1 ) T Q(x 2 — xi) > 

(u 2 - ui) T Q(x 2 - xi) > (x 2 - xi)Q(x 2 - xi). 

Letting u = u 2 — u\, and x = x 2 — x\, we have x T Qx < u 1 Qx. Since Q is positive semidefinite, we 
have (u — x) T Q{ii — x) > 0, or equivalently u T Qu + i T Qx — 2x T Qu > (using the fact Q is also 
symmetric). Thus, 

u T Qu > — x T Qx + 2x T Qu > — x T Qx + 2x T Qx = x T Qx, 

and so 

\\A(u 2 - Ul )\\ 2 = u T Qu > x T Qx = \\A(x 2 - x x )f . 

□ 



3 Specific Adaptive Algorithms and Competitive Ratios 



Before proceeding to the specific results, we establish several results that will be useful in the subsequent 
arguments. In order to prove that adaptive schemes for selecting Qt have good competitive ratios for the 
bound optimization problem, we will need to compare the bounds obtained by the adaptive scheme to the 
optimal post-hoc bound of Equation (0]). Suppose the sequence Qi , . . . , Qt is optimal for Equation (0]), 
and consider the alternative sequence Q\ = Qut an d Q't = f° r t > 1. Using the fact that Qi :t y 
Q\:t-\ implies Q~{. t < Q^. t _i, it is easy to show the alternative sequence also achieves the minimum. 
It follows that a sequence with Qi = Q on the first round, and Q t = thereafter is always optimal. 
Hence, to solve for the post-hoc bound we can solve an optimization of the form 

mi ( max (^Qy) + ]T . 9t T <T V J • (10) 



QGQ \ ye^ ym V 2 

The diameter of J 7 is D = max yy i e jr \\y—y'\\2, andsofory G J- sym , \\yW2 < D. When J 7 is symmetric 
(x G J- implies —x G J 7 ), we have y G J- if and only if 2y G J r S ym< so (TTOt is equivalent to: 

inf (m^(2y T Qy)+Y,g7Q~ 1 9t) ■ (H) 

For simplicity of exposition, we assume g\ t i > for all i, which ensures that only positive definite 
matrices can be optimal@ This assumption also ensures Q\ G S'™ + for the adaptive schemes discussed 
below, as required by Theorem|2] This is without loss of generality, as we can always hallucinate an 
initial loss function with arbitrarily small components, and this changes r egret by an arbitrarily small 
amount. We will also use the following Lemma from lAuer and Gentile 11200(1 ' For completeness, a 
proof is included in AppendixlBl 

Lemma 5. For any non-negative real numbers Xi, X2, ■ ■ ■ , x n , 



y , Xt <2 A 



3.1 Adaptive coordinate-constant regularization 

We derive bounds where Q t is chosen from the set Q C onst, and show that this algorithm comes within 
a factor of V2 of using the best constant regularization strength XI. This algorithm achieves a bound 
of 0(DM\/T) where D is the diameter of the feasibl e region and M i s a bo und on ||<?t||2, matching 
the best possible bounds in terms of these parameters I Abernethy et al. . 2008 1. We will prove a much 



stronger competitive guarantee for this algorithm in Theorem|6] 
Corollary 1. Suppose T has L2 diameter D. Then, if we run FTPRL with diagonal matrices such that 



(Qi-.t)u = a t 



D 



2 In the case where T has width in some direction, the infimum will not be attained by a finite Q, but by a sequence that 
assigns penalty (on the right-hand side) to the components of the gradient in the direction of width, requiring some entries in 
Q to go to 00. 
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where G t = =1 E"=i 9r,i> then 

Regret < 2Dy/G^. 

If \\9tW2 < -W. Gt < M 2 T, and this translates to a bound of 0{DM^/T). When J- = 
{x I IMI2 < D/2}, this bound is \[2- competitive for the bound optimization problem over Q CO nsi- 

Proof. Let the diagonal entries of Q t all be at = at — a>t-i (with ao = 0), so a\-t = at- Note 
a* > 0, and so this choice is feasible. We consider the left and right-hand terms of Equation (0 
separately. For the left-hand term, letting ijt be an arbitrary sequence of points from .Fsym, and noting 

yJyt<\\yth-\\yth<D 2 , 

T T T 

2 E vtQm = 5 E ^* at ^ ^ E at = \ D ^ T = 

For the right-hand term, we have 

T T 71 a 2 T nV" a 2 , 

E* T «i«- £ £ ^7 - £ 7 ^ 

t=i t=i i=i 1 - t t=i v f 

where the last inequality follows from Lemma|5] 

In order to make a competitive guarantee, we must prove a lower bound on the post-hoc optimal 
bound function Br, Equation ( fTOb . This is in contrast to the upper bound that we must show for the 
regret of the algorithm. When T = {x | ||x||2 < D/2], Equation (fTOb simplifies to exactly 

min ( -aD 2 + -G T ) = Dx/2G T (12) 
a>o \2 a J 

and so we conclude the adaptive algorithm is \/2-competitive for the bound optimization problem. □ 



3.2 Adaptive diagonal regularization 

In this section, we introduce and analyze FTPRL-Diag, a specialization of FTPRL that uses regular- 
ization matrices from Qdiag- Let Di = m&x x . x i £ j- \x% — x\\, the width of T along the ith coordinate. 
We construct a bound on the regret of FTPRL-Diag in terms of these Di. The Di implicitly define a 
hyperrectangle that contains T . When T is in fact such a hyperrectangle, our bound is \/2-competitive 
with the best post-hoc optimal bound using matrices from Qdiag- 

Corollary 2. Let T be a convex feasible set of width Di in coordinate i. We can construct diagonal 
matrices Qt such that the ith entry on the diagonal ofQi-t is given by: 



Di 



2 

r.i ' 



Then the regret of FTPRL satisfies 



Regret <2^D 



i=i \ t=i 



E 



9t 
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Algorithm 1 FTPRL-Diag 



Algorithm 2 FTPRL-Scale 



Input: feasible set T C x™ =1 [a^, fej] 
Initialize ii = 6 J 

(Vi), G i = 0,g i = J A o ,i = 0.A = 6i 
for t = 1 to T do 

Play the point xt, incur loss ft(xt) 

Let g t G df t (x t ) 

for i = 1 to 7i do 
Gi = Gj + g tii 

At,i = %~^fG~i — \l;t—\,i 

Qi = q t + X t ,iXt,i 

Ut+l,i = (gi:t,i - '/< •' A I : • 

end for 

At = diag( v /Ai :t ,i, . . . , y/X 1:t , n ) 
x t +i = Project jr jAt (u t +i) 
end for 



Input: feasible set T C {x | < 1}, 

with A g 5™ + 
Let J-= {x | ||x|| < 1} 
Initialize xi = 0, (Vi) A = 6, — a, 
for rj = 1 to T do 

Play the point x t , incur loss ft{xt) 
Let g df t (x t ) 



9i 



« = V Er=l £"=1 9r 

a t =a - ai-.t-i 
qt = oi t x t 

u t +i = (l/a)(q 1: t - gi-.t) 
A t = (al)i 

x t +i = Project^ )j4t (u t +i) 
x t +i = A~ 1 x 
end for 



When T is a hyperrectangle, then this algorithm is y/2-competitive with the post-hoc optimal choice of 
Qt from the Qdi ag - That is, 

Regret < V2 inf f max (U T Qy) + V gjQ~ l 9t] ■ 

Proof. The construction of Q\- t in the theorem statement implies (Qt)u = \t,i = At,, — ^t-i,i- These 
entries are guaranteed to be non-negative as \ t .i is a non-decreasing function of t. 

We begin from Equation (0, letting y t be an arbitrary sequence of points from J-" S y m . For the left- 
hand term, 



T T n n T n n 

\ E yJQm = \ E E &.M ± \ E a 2 E - J E = E ^ 



t=l t=l i=l 



i=l t=l 



i \ t=i 



E 



9t, 



For the right-hand term, we have 



E sjQia9t = EEr 2i = E i ?E 



< 



Ea 



where the last inequality follows from Lemma [5] Summing these bounds on the two terms of Equa- 
tion (0 yields the stated bound on regret. 

Now, we consider the case where the feasible set is exactly a hyperrectangle, that is, T = {x | Xi g [at, 
where Di = hi — a^. Then, the optimization of Equation ( fTOb decomposes on a per-coordinate ba- 
sis, and in particular there exists a y g J^ym so that = £)? in each coordinate. Thus, for Q = 
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diag(Ai , . . . , A„), the bound function is exactly 



T 

2 



2 1 4 A, 

i=l i=l 



Choosing A, = ^- y 2 X)t=i fft i minimizes this quantity, producing a post-hoc bound of 



verifying that the adaptive scheme is \/2-competitive with matrices from Qdiae- 



□ 



The regret guarantees of the FTPRL-Diag algorithm hold on arbitrary feasible sets, but the compet- 
itive guarantee only applies for hyperrectangles. We now extend this result, showing that a competitive 
guarantee can be made based on how well the feasible set is approximated by hyperrectangles. 

Theorem 3. Let T be an arbitrary feasible set, bounded by a hyperrectangle H 0M of width Wi in 
coordinate i; further, let H m be a hyperrectangle contained by J-, of width Wi > in coordinate i. That 
is, H m C JC H m ". Let (3 = max^ Then, the FTPRL-Diag is y/2 fj -competitive with Qdiag on T. 



Proof By Corollary|2] the adaptive algorithm achieves regret bounded by 2 Yli=i ^'yEfc=i 9ti- We 
now consider the best post-hoc bound achievable with diagonal matrices on J 7 . Considering Equa- 
tion ([Tol l, it is clear that for any Q, 



I T _ i T 

max -y T Qy + Y]g t T Q 1 9t> max -y T Qy + YV T Q 1 9t, 

since the feasible set for the maximization (J-" sym ) is larger on the left-hand side. But, on the right-hand 
side we have the post-hoc bound for diagonal regularization on a hyperrectangle, which we computed 



in the previous section to be a/2 2~^™=i w iyJ2t=i 9t %■ Because Wi > ^ by assumption, this is lower 
bounded by ^ ^* \jY^t=i 9tv which proves the theorem. □ 

Having had success with L^, we now consider the potential benefits of diagonal adaptation for 
other L p -balls. 



3.3 A post-hoc bound for diagonal regularization on L p balls 

Suppose the feasible set F is an unit L p -ball, that is F — {x \ \\x\\ p < 1}. We consider the post-hoc 
bound optimization problem of Equation ( fTTT ) with Q = Qdiag- Our results are summarized in the 
following theorem. 

Theorem 4. For p > 2, the optimal regularization matrix for Br in Qdi ag is not coordinate-constant 
(i.e., not contained in Q const ), except in the degenerate case where Gi = X)t=i 9ti ' s ^ e same f or aR i- 
However, for p < 2, the optimal regularization matrix in Qdiag always belongs to Q CO nst- 
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Proof. Since JF is symmetric, the optimal post-hoc choice will be in the form of Equation dTTb . Letting 
Q = diag(Ai , . . . , A„), we can re-write this optimization problem as 



max 



To determine the optimal A vector, we first derive a closed form for the solution to the maximization 
problem on the left hand side, assuming p > 2 (we handle the case p < 2 separately below). First note 

that the inequality \\y\\ p < 1 is equivalent to Y^7—i \Ui\ P — 1- Making the change of variable Zi = yf, 

p 

this is equivalent to Y^i=i z i — 1' which is equivalent to |j z\\ e < 1 (the assumption p > 2 ensures that 
|| ■ || p is a norm). Hence, the left-hand side optimization reduces to 



max 2 V" z { \ t = 2||A|| 9 , 



where q = so that || ■ || z and || • || g are dual norms (allowing q = oo for p = 2). Thus, for p > 2, 
the above bound simplifies to 

n „ 

B(A) = 2||A|| g + ^-i. (14) 

i=i 1 

First suppose p > 2, so that q is finite. Then, taking the gradient of £> (A), 
using i — 1 = — i(g — 1). If we make all the A^'s equal (say, to Ai), then for the left-hand side we get 

q - x / \i Y' 1 ( i y- 1 . 1 

= ni 




IAI 



Thus the i th component of the gradient is 2w 1 — and so if not all the Gi's are equal, some 
component of the gradient is non-zero. Because B(X) is differentiable and the A^ > constraints cannot 
be tight (recall gi > 0), this implies a constant A^ cannot be optimal, hence the optimal regularization 
matrix is not in <2 C onst- 

For p £ [1, 2], we show that the solution to Equation ( fT3l ) is 

n „ 

B 0o (A)=2||A|| 0O + X;-r- (15) 

For p = 2 this follows immediately from Equation (fT4l . because when p = 2 we have q = oo. For 
p £ [1, 2), the solution to Equation ( fT3] l is at least i?oo(A), because we can always set j/j = 1 for 
whatever A; is largest and set yj = for j ' ^ i. If p < 2 then the feasible set J 7 is a subset of the unit 
L2 ball, so the solution to Equation dT3l l is upper bounded by the solution when p = 2, namely B x (A). 
It follows that the solution is exactly B 00 (\). Because the left-hand term of B QO (X) only penalizes for 
the largest A,, and on the right-hand we would like all Ai as large as possible, a solution of the form 
Ai = A2 = • • • = A„ must be optimal. □ 
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3.4 Full matrix regularization on hyperspheres and hyperellipsoids 

In this section, we develop an algorithm for feasible sets T C {x | ||Ae|| p < 1}, where p 6 [1,2] 
and A e When T = {x \ \\Ax\\ 2 < 1}, this algorithm, FTPRL-Scale, is v^-competitive with 

arbitrary Q G S?. For J 7 = {a; | || Ax\\ p < 1} with p€ [1, 2) it is \/2-competitive with Qdiag- 

First, we show that rather than designing adaptive schemes specifically for linear transformations of 
norm balls, it is sufficient (from the point of view of analyzing FTPRL) to consider unit norm balls if 
suitable pre-processing is applied. In the same fashion that pre-conditioning may speed batch subgradi- 
ent descent algorithms, we show this approach can produce significantly improved regret bounds when 
A is poorly conditioned (i.e., the ratio of the largest to smallest eigenvalue is large). 

Theorem 5. Fix an arbitrary norm \\ ■ ||, and define an online linear optimization problem T = 
(J 7 , (<7i, . . . , <7t)) where J- = {x \ \\Ax\\ < 1} with A G S++- We define the related instance X = 
(J 7 , (<7i, . . . , <7x))> where T = {x | ||i|| < 1} and g t = A~ l g t . Then: 

• If we run any algorithm dependent only on subgradients on T, and it plays X\, . . . , xt, then by 
playing the corresponding points x t = A~ 1 Xt on X we achieve identical loss and regret. 

• The post-hoc optimal bound over arbitrary Q £ S 7 ?, is identical for these two instances. 
Proof. First, we note that for any function h where min^.n^j.^! h(x) exists, 

min h(x) = min /i(A -1 a;), (16) 

z:||Ar;||<l *:p||<l 

using the change of variable x = Ax. For the first claim, note that gj = gj A" 1 , and so for all 
t, gj it = gj A~ x Axt = gjxt, implying the losses suffered on I and 1 are identical. Applying 
Equation (fT~6b , we have 

min gj.. t x = . min glf A^x = mm gj. t x, 

x:\\Ax\\<l i:||x||<l x:p||<l 

implying the post-hoc optimal feasible points for the two instances also incur identical loss. Combining 
these two facts proves the first claim. For the second claim, it is sufficient to show for any Q £ £?, 
applied to the post-hoc bound for problem I, there exists a Q G S? , that achieves the same bound for 
I (and vice versa). Consider such a Q fori. Then, again applying Equation ( fT6] l, we have 



T 

max {2y T Qy)+Y / gjQ- 1 g t = max (2y T A^QA^y) + V gjAQ^Agt. 
y-\\Ay\\ p <i ~ y:||y||<i ~ x 



The left-hand side is the value of the post-hoc bound on X from Equation ( TTTb . Noting that (A~ 1 QA~ 1 )~ 1 
AQ~ X A, the right-hand side is the value of the post hoc bound fori using Q = A~ 1 QA~ 1 . The fact 
A^ 1 and Q are in 5 f ™ + guarantees Q G 5?, as well, and the theorem follows. □ 

We can now define the adaptive algorithm FTPRL-Scale: given a T C {x \ \\Ax\\ p < 1}, it uses 
the transformation suggested by Theorem[5j applying the coordinate-constant algorithm of Corollary [T] 
to the transformed instance, and playing the corresponding point mapped back into Pseudocode is 
given as Algorithm[2] 



3 By a slightly more cumbersome argument, it is possible to show that instead of applying this transformation, FTPRL can be 
run directly on T using appropriately transformed Q t matrices. 
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Theorem 6. The diagonal-constant algorithm analyzed in Corollary\J\is v2- competitive with S*™ when 
J- = {x | ||x|L < 1} for p = 2, and \/2-competitive against Qdiag when p£ [1, 2). Furthermore, when 
J- = {x | |j Ae|| p < 1} with A £ S™,, the FTPRL-Scale algorithm (Algorithm^ achieves these same 
competitive guarantees. In particular, when T — {x | ||x||2 < 1}, we have 



Regret < s/2 inf 

Qes" 



max (2y Qy) 




Proof. The results for Qdiag with p £ [1,2) follow from Theorems [4] and [5] and Corollary Q] We now 
consider the case p = 2. Consider a Q £ for Equation (fTTT i (recall only a Q £ 5" + could be 
optimal since g\ > 0). We can write Q = PDP T where D = diag(Ai, . . . , A n ) is a diagonal matrix of 
positive eigenvalues and PP T = I. It is then easy to verify Q^ 1 = PZ? _1 P T . 
When p = 2 and T = {x \ \\x\\ 



ip < 1}, Equation (fTBI l is tight, and so the post-hoc bound for Q is 

T 

2max(X l ) + Y gj {PD- 1 P T )g t . 



Let z t = P T g t , so each right-hand term is X)"=i T 11 - ^ ^ s c l ear this quantity is minimized when 
each Xi is chosen as large as possible, while on the left-hand side we are only penalized for the largest 
eigenvalue of Q (the largest A^). Thus, a solution where D = al for a > is optimal. Plugging into 
the bound, we have 



B(a) = 2a + V 



pii/ 

a 



P' )g t =2a 



1 T 

ni ■ 



_ 9t 9t = 2a H 

t=l 



where we have used the fact that PP T = I. Setting a = y/ Gt/2 produces a minimal post-hoc bound 
of 2y/2Gr- The diameter D is 2, so the coordinate-constant algorithm has regret bound (Corol- 
lary[T]i, proving the first claim of the theorem for p = 2. The second claim follows from Theorem[5] □ 

Suppose we have a problem instance where J 7 = {x | ||At||2 < 1} where A = diag(l/ai, . . . , l/a„) 
with a, > 0. To demonstrate the advantage offered by this transformation, we can compare the regret 
bound obtained by directly applying the algorithm of Corollary[T]to that of the FTPRL-Scale algorithm. 
Assume WLOG that max; o» = 1, implying the diameter of T is 2. Let gi, . . . , gx be the loss functions 
for this instance. Recalling Gi 



St=i 9ti> a PPly m 8 Corollary Q]directly to this problem gives 



Regret < A, 



Gi. 



(17) 



This is the same as the bound obtained by online subgradient descent and related algorithms as well. 

We now consider FTPRL-Scale, which uses the transformation of Theorem Noting D = 2 for 
the hypersphere and applying Corollary Q]to the transformed problem gives an adaptive scheme with 



Regret < A, 



\ i=i t=i 



n T 



This bound is never worse than the bound of Equation ( fT71 ), and can be arbitrarily better when many of 
the cti are much smaller than 1 . 
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4 Related work 



In the batch convex optimization setting, it is well known that convergence rates can often be dramati- 
cally improved through the use of preconditioning, accomplished by an appropriate c hange of coordi- 

nates taking into account both the shape of the objective function and the feasible region IIBoyd and Vandenberghe 
2004]. To our knowledge, this is the first work that extends these concepts (necessarily in a quite dif- 



ferent form) to the problem of online convex optimization, where they can provide a powerful tool for 
improving regret (the online analogue of convergence rates). 

Perhaps the clos est algorithms in spir it to our diag onal adaptation algorit hm are confidence-weighted 
linear classification | Drezde et al. , 2008 ] and AROW [ Crammer et al. , 2009tl , in that they make different- 
sized adjustments for different coordinates. Unlike our algorithm, these algorithms apply only to clas- 
sification problems and not to general online convex optimization, and the guarantees are in the form of 

mistake bounds rather than regret bounds. 

FTPRL is similar to the lazily-projected gradient descent algorithm of | Zinkevichl 2004 , Sec. 5.2.3], 
but with a critical difference: the latter effectively centers regularization outside of the current feasible 
region (at u t rather than Xt). As a consequence, lazily-projected gradient descent only attains low regret 
via a re-starting mechanism or a constant learning rate (chosen with knowledge of T). It is our technique 
of always centering additional regularization inside the feasible set that allows us to make guarantees 
for adaptively-chosen regularization. 

Most recent state-of-the-art algorithms for online learning are in fact general algorithms for online 
convex optimization applied to learning problems. Many of thes e algorithms can be thought of as 
(significant) extensions of online subgradient descent, including I Duchi and Singeij, 20091 Do et al 



2009 , Shalev-Shwartz et al. , 2007 1 . Apart from the very general work of I Kalai and VempalaT 2005 1. 



few general follo w-the-regular ized-leader algorithms have been analyzed, with the notable exception of 
the recent work of lXiaol 11200911 . 

The notion of proving competit ive ratios for regret b oun ds that are func tions of regularization pa- 
rameters is not unique to this paper. iBartlett et al.l 0200811 and lDo et al. I l2009ll proved guarantees of this 
form, but for a different algorithm and class of regulari zation parameters. 

In concurrent work I Streeter and McMahan , 2010ll . the authors proved bounds similar to those of 
Corollary |2] for online gradient descent with per-coordinate learning rates. These results were signifi- 
cantly less general that the ones presented here, and in particular were restricted to the case where T 
was exactly a hyperrectangle. The FTPRL algorithm and bounds proved in this paper hold for arbi- 
trary feasible sets, with the bound depending on the shape of the feasible set as well as the width along 



each d imension. Some results similar to those in this work were developed concurrently bv lDuchi et al 
ll2010ll . though for a different algorithm and using different analysis techniques. 



5 Conclusions 

In this work, we analyzed a new algorithm for online convex optimization, which takes ideas both from 
online subgradient descent as well as follow-the-regularized-leader. In our analysis of this algorithm, 
we show that the learning rates that occur in standard bounds can be replaced by positive semidefinite 
matrices. The extra degrees of freedom offered by these generalized learning rates provide the key to 
proving better regret bounds. We characterized the types of feasible sets where this technique can lead 
to significant gains, and showed that while it does not help on the hypersphere, it can have dramatic 
impact when the feasible set is a hyperrectangle. 
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The diagonal adaptation algorithm we introduced can be viewed as an incremental optimization of 
the formula for the final bound on regret. In the case where the feasible set really is a hyperrectangle, 
this allows us to guarantee our final regret bound is within a small constant factor of the best bound 
that could have been obtained had the full problem been known in advance. The diagonal adaptation 
algorithm is efficient, and exploits exactly the kind of structure that is typical in large-scale real-world 
learning problems such as click-through rate prediction and text classification. 

Our work leaves open a number of interesting directions for future work, in particular the develop- 
ment of competitive algorithms for arbitrary feasible sets (without resorting to bounding norm-balls), 
and the development of algorithms that optimize over richer families of regularization functions. 
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A A Proof of the FTRL Bound 



In this section we provide a proof of Lemma Q] The high-level structure of our proof follows Kalai 
and Vempala's analysis of the follow the perturbed leader algorithm, in that we prove bounds on three 
quantities: 

1 . the regret of a hypothetical be the leader algorithm (BTL), which on round t plays 

x* = axgmmf 1:t (x), 



2. the difference between the regret of BTL and that of the be the regularized leader algorithm 
(BTRL), which plays 

x t = argmin(ri :t (a:) + fi :t (x)) = x t +i, (18) 



and 



3. the difference between the regret of BTRL and that of FTRL. 

As shown in I Kalai and Vempalal 2005 1. the BTL algorithm has regret < even without any re- 
strictions on the loss functions or the feasible set. The proof is a straightforward induction, which we 
reproduce here for completeness. 



Lemma 6 ( IKalai and Vempalal 1200511 ). Let /i, fa, ■ ■ ■ , /t be an arbitrary sequence of functions, and 
let T be an arbitrary set. Define x^ = argmin^gjr X)t=i fr( x )- Then 

T T 

£M*?)<£M*r). 

Proof. We prove this by induction on T. For T = 1 it is trivially true. Suppose that it holds for T — 1. 
Then 

T T—l 



t=l 



£/t(*n = /r(:4) + £/*(*?) 

t—i 

T-l 

< h{x* T ) + £ /tOr-i) 

i=l 
T-l 

< Ma*) + E /'(*t) 
t 

T 

= E/t(«r) 



t=i 



(Induction hypothesis) 
(Definition of x T _ 1 ) 



□ 



We next prove a bound on the regret of BTRL. 



Lemma 7. Let r\,r%, . . . ,Tt be a sequence of non-negative functions. Then BTRL, which on round t 
plays x t as defined by equation (1181 ), has regret at most rx-.rijx) where x is the post-hoc optimal solution. 
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Proof. Define f[(x) = ft(x) + r t (x). Observe that x t = argmin a . g j C - f[. t (x). Thus, by Lemma[6l we 
have 

T 

J2U(^)<rnmf UT (x)<f' UT (x) 



t=l 



or equivalently, 

T 

^2ft(x t ) + n{x t ) < ri :T (x) + fl:T(x). 
t=l 

Dropping the non-negative rt (xt ) terms on the left hand side proves the lemma. □ 

By definition, the total loss of FTRL (which plays xt) exceeds that of BTRL (which plays it = 
Xt+i) by Y^t=i ft( x t) — ft(xt+i)- Putting these facts together proves LemmaQ] 

B Proof of Lemma |5] 

Proof. The lemma is clearly true for n = 1. Fix some n, and assume the lemma holds for n—1. Thus, 



ri-l 

Xi 



Vz 

where we define Z = Y^i=i x i an< ^ x = x n- The derivative of the right hand side with respect to x is 
7= H — =, which is negative for x > 0. Thus, subject to the constraint x > 0, the right hand side is 

maximized at x = 0, and is therefore at most 2\fZ. □ 
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